Add CLI sandbox runtime and fix HF Jobs script paths by lewtun · Pull Request #237 · huggingface/ml-intern

lewtun · 2026-05-07T08:53:01Z

Summary

Add an opt-in CLI sandbox tool runtime so local models can use HF Space sandbox tools, including sandbox_create.
Keep the default CLI runtime unchanged: bash, read, write, and edit operate on the local filesystem unless sandbox tools are explicitly enabled.
Treat bare Python filenames such as train_smollm2.py as sandbox script paths when resolving hf_jobs payloads.
Clarify prompt/tool guidance so hf_jobs.script uses inline code, sandbox files, or public/raw URLs, not host-local paths like /fsx/....

Why

This came up while testing local model workflows from #228. Small local models can create a training script, then call hf_jobs with script: "train_smollm2.py" or a host-local path. Previously bare filenames were not resolved from the sandbox, so the remote HF Job could fail with No such file or directory.

The same testing also showed that CLI local mode intentionally swaps sandbox tools for local filesystem tools, which means sandbox_create is unavailable by default. This PR makes the runtime explicit and opt-in instead of changing the default behavior.

Usage

Default local filesystem tools:

ml-intern --model llamacpp/ggml-org/gemma-3-1b-it-GGUF

Opt into HF Space sandbox tools:

ml-intern --model llamacpp/ggml-org/gemma-3-1b-it-GGUF --sandbox-tools

Or via config:

{
  "tool_runtime": "sandbox"
}

Sandbox mode requires an HF token even when the LLM itself is local, because it creates private HF Spaces.

Behavior

tool_runtime: "local" uses local bash/read/write/edit and excludes sandbox_create.
tool_runtime: "sandbox" uses sandbox bash/read/write/edit and includes sandbox_create.
Sandbox CLI sessions preload a cpu-basic sandbox and clean it up on shutdown.
Bare sandbox Python files like train_smollm2.py, ./train_smollm2.py, and /app/train_smollm2.py are read and submitted to HF Jobs as script content.
URLs such as https://example.com/train.py are still passed through as Jobs URLs.
Host-local paths such as /Users/..., /home/..., and /fsx/... are explicitly discouraged in the system prompt.

Example Failure Fixed

Before:

{"script": "train_smollm2.py"}

could fail remotely as:

Failed to spawn: train_smollm2.py
Caused by: No such file or directory

After this change, the CLI reads train_smollm2.py from the sandbox and submits the script content, matching the behavior for explicit paths like ./train_smollm2.py.

Tests

uv run pytest tests/unit
uv run ruff check .
uv run ruff format --check .

Follow-up to #228.

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-07T08:53:24Z

Claude finished @lewtun's task in 2m 4s —— View job

Review

LGTM — no blocking issues, 1 P2.

The widening of _looks_like_path is well-targeted and the URL bail-out at agent/tools/sandbox_tool.py:70-71 is necessary (a bare https://example.com/foo.py ends in .py and would otherwise be misrouted to sandbox.read). The not any(c.isspace()) guard on the new .py branch correctly keeps train_smollm2.py --epochs 1 and most one-line inline code out of the path bucket.

I traced both call sites:

agent/tools/jobs_tool.py:1260-1270 — server-side resolution before _resolve_uv_command.
agent/core/agent_loop.py:1546-1556 — best-effort pre-resolution for the approval UI; failures are silently discarded and resolution is retried in the handler. Both stay correct under the new behavior.

A bare filename that doesn't exist in the sandbox now produces a sandbox-read error instead of being passed through to remote uv run. That's not a regression — remote uv run train.py failed in the prior path too — but it's a small UX shift worth keeping in mind if anyone reports "I had a workflow relying on the file being staged remotely."

Tests cover the four shapes that matter (bare filename, nested relative, URL, name-with-args / inline). asyncio_mode = "auto" is set in pyproject.toml, so the @pytest.mark.asyncio decorators are redundant but harmless.

P2 — tool description out of date

agent/tools/jobs_tool.py:1159 still describes script as "Python code or sandbox file path (e.g. '/app/train.py') or URL." With this change, bare filenames like train_smollm2.py are also resolved from the sandbox. Worth mentioning so the model picks up the new affordance. Fix this →
• codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

lewtun · 2026-05-07T08:58:31Z

Addressed in 4be6975: updated the hf_jobs script parameter description to mention bare sandbox Python filenames like train.py, alongside absolute/relative paths and URLs. I also added a unit test that pins the model-facing description.

Checks:

uv run pytest tests/unit
uv run ruff check .
uv run ruff format --check .

github-actions · 2026-05-07T08:58:39Z

Claude finished @lewtun's task in 52s —— View job

Review

LGTM — no blocking issues.

The P2 from the prior review is addressed in 4be6975:

Description at agent/tools/jobs_tool.py:1159 now reads "Python code, sandbox file path (e.g. '/app/train.py', './train.py', or bare 'train.py'), or URL." — bare filenames are now surfaced to the model alongside absolute/relative paths and URLs.
New test test_hf_jobs_script_description_mentions_bare_python_filenames at tests/unit/test_sandbox_script_resolution.py:64-69 pins the description so future edits that drop the bare-filename affordance will fail loudly. Intentional tight string match — fine.

Nothing else changed in this commit; the resolution logic in agent/tools/sandbox_tool.py and the existing parametrized tests are unchanged from the prior review.
• codex/resolve-bare-python-job-scripts

github-actions · 2026-05-07T15:16:57Z

Claude finished @lewtun's task in 1m 42s —— View job

Review

LGTM — no blocking issues.

The new commit since the prior review at 4be6975 is just e91b538, a main merge. The PR's logical diff against origin/main (agent/tools/sandbox_tool.py:60-80, agent/tools/jobs_tool.py:1160, tests/unit/test_sandbox_script_resolution.py) is unchanged from the previous LGTM.

The merge brought in two orthogonal edits to PR-touched files from d7637ba (Pro entitlement fix):

agent/tools/jobs_tool.py:631-639 — credit-error copy now mentions Pro vs. credits separation; doesn't interact with script resolution.
agent/tools/sandbox_tool.py:36,204 — rename _SANDBOX_NAME_RE → SANDBOX_SPACE_NAME_RE and removal of a metadata_update call in _create_sandbox_locked; both unrelated to _looks_like_path / resolve_sandbox_script.

No conflict-resolution issues in the PR's resolver or tests. Couldn't run the test suite locally (sandbox didn't permit uv run), so trusting CI on tests/unit/test_sandbox_script_resolution.py.
• codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-08T07:59:58Z

Claude finished @lewtun's task in 2m 17s —— View job

Review

LGTM — no blocking issues.

The new commit since the prior review at e91b538 is 7b4687b ("Clarify HF Jobs script path prompt"), a purely additive change:

agent/prompts/system_prompt_v3.yaml:105-109 — adds an hf_jobs.script guidance block telling the model to never pass local-machine paths (/Users/..., /home/..., /fsx/..., or repo checkouts) and to use exactly one of: inline source, a sandbox file (/app/train.py, ./train.py, or bare train.py), or a URL. This is a sensible companion to the widened _looks_like_path heuristic — once /Users/lewtun/foo.py (matching script.startswith("/")) and bare filenames both route through sandbox.read, the failure surface for "I have a local file path" goes from "fails remotely with No such file or directory" to "fails locally with a sandbox-read error." The prompt heads that off rather than masking it.
tests/unit/test_sandbox_auto_start.py:38-45 — pins four substrings of the new prompt block (Never pass a local machine path to hf_jobs.script, /fsx/..., inline Python source code, a file already written in the session sandbox). Tight string match; intentional and fine.

The PR's logical resolution code (agent/tools/sandbox_tool.py:60-80, agent/tools/jobs_tool.py:1160, tests/unit/test_sandbox_script_resolution.py) is unchanged from the prior LGTM at 4be6975. The merge in e91b538 was reviewed previously and brought in unrelated edits to PR-touched files; nothing further has shifted there.

Couldn't run the test suite locally (no uv run in this environment), so trusting CI on tests/unit/test_sandbox_auto_start.py and tests/unit/test_sandbox_script_resolution.py.
• codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-08T08:31:46Z

Claude finished @lewtun's task in 7m 56s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The new commit since the prior review at 7b4687b is e63b945 ("Add CLI sandbox tool runtime"). It widens the CLI from "always local-tools" to a runtime selector exposed as --sandbox-tools and the tool_runtime config field. The original PR-resolution code (agent/tools/sandbox_tool.py:60-78, agent/tools/jobs_tool.py:1160) is unchanged.

I traced the new control flow:

Tool runtime plumbing — agent/main.py:60-72 adds the override + label helpers; agent/main.py:984-988, 1228-1233 wires --sandbox-tools through _apply_tool_runtime_override to local_mode for both interactive and headless paths. agent/main.py:1024-1027, 1262-1264 and agent/main.py:1041, 1276 pass it consistently into ToolRouter and submission_loop. agent/core/session.py:99 now stores local_mode on the session — used by the shutdown handler.
HF token gate — agent/main.py:993-995, 1234-1240: not hf_token and (not is_local_model_id(...) or not local_mode). The four cells (hosted/local model × local/sandbox runtime) match what the new tests pin: only local_model + local_runtime skips the prompt. Correct.
Tool selection — agent/core/tools.py:389-395 already had the local/sandbox split; this commit just makes it dynamic per-session. tests/unit/test_sandbox_auto_start.py:55-65 pins that sandbox_create is excluded from local-runtime tools and present in sandbox-runtime tools.
Sandbox lifecycle — agent/core/agent_loop.py:2008-2009 preloads a CPU sandbox at submission_loop start when local_mode=False; agent/core/agent_loop.py:1933-1934 calls teardown_session_sandbox on OpType.SHUTDOWN. Both interactive and headless mains submit SHUTDOWN (agent/main.py:1191-1195, 1425-1429), so the happy path tears down the owned space. tests/unit/test_sandbox_auto_start.py:67-136 covers the start/teardown wiring.
Test coverage — solid: 4 cells of the token-prompt matrix (tests/unit/test_cli_rendering.py:147-225), --sandbox-tools flag forwarding for both entry points (tests/unit/test_cli_rendering.py:67-122), tool_runtime config validation including invalid values (tests/unit/test_config.py:127-156), banner string assertion (test_cli_rendering.py:131).

No P0 surface I can find. The pre-existing prompt-vs-mode tension (base prompt at agent/prompts/system_prompt_v3.yaml:107 says "use a sandbox file like /app/train.py" while the local-mode addendum at agent/context_manager/manager.py:241-249 says "do NOT use /app/ paths") is unchanged by this commit and was already informational on prior reviews; not re-flagging per re-review convergence.

Couldn't run the test suite locally (no uv run in this environment), so trusting CI on the new tests.
• Branch: codex/resolve-bare-python-job-scripts

github-actions · 2026-05-08T09:05:47Z

Claude finished @lewtun's task in 2m 8s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The new commit since the prior review at e63b945 is 77a806b, a main merge bringing in the /resume command (#233) and other unrelated changes from main. The PR's logical diff against origin/main is unchanged from the previous LGTM:

agent/tools/sandbox_tool.py:60-78 — widened _looks_like_path, unchanged.
agent/tools/jobs_tool.py:1160 — script description, unchanged.
agent/prompts/system_prompt_v3.yaml:103-108 — local-path warning block, unchanged.
agent/main.py — --sandbox-tools flag + _apply_tool_runtime_override + HF token gate, unchanged.
agent/core/agent_loop.py:1944-1947, 2032-2033 — sandbox preload/teardown wiring, unchanged.

I traced the merge to confirm no interaction with the PR's surface:

The merge added OpType.RESUME and Handlers.resume to agent/core/agent_loop.py (around line 1675/1977) and /resume plumbing to agent/main.py:905 — both orthogonal to the PR's local_mode plumbing. The shutdown path at agent_loop.py:1992-1993 still routes through Handlers.shutdown, which still gates teardown_session_sandbox on not session.local_mode. Both interactive and headless main still submit OpType.SHUTDOWN (agent/main.py:1311, 1545).
The merge added DEFAULT_SESSION_LOG_DIR to agent/core/session.py; the PR's self.local_mode = local_mode at line 99 sits next to it without conflict.
agent/core/session_resume.py (new from the merge) doesn't mutate local_mode or tool_runtime — resume restores message/event history only, so the runtime selected by the current process flag stands. Reasonable: a resumed sandbox-mode session brought into a local-mode invocation would just see references to unavailable tools in past messages, which is a UX nit, not a state-corruption bug, and not introduced by this PR.
agent/utils/terminal_display.py:459 (the /resume HELP_TEXT line) sits below the PR's banner change at line 119; no overlap.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on tests/unit/test_sandbox_auto_start.py, tests/unit/test_sandbox_script_resolution.py, tests/unit/test_cli_rendering.py, and tests/unit/test_config.py.
• Branch: codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-08T09:14:49Z

Claude finished @lewtun's task in 55s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The new commit since the prior review at 77a806b is 105ca83 ("Document CLI sandbox tool runtime"), a pure docs change. git diff 77a806b..HEAD is README.md only (+25/-0):

Adds a --sandbox-tools example to the existing CLI invocation block at README.md:66.
Adds a "CLI tool runtime" subsection at README.md:101-124 documenting the default-local behavior, the --sandbox-tools opt-in, the tool_runtime config field, the HF_TOKEN requirement under sandbox runtime, and when to use each runtime.

I spot-checked the docs against the code:

--sandbox-tools flag — agent/main.py:1590. ✓
tool_runtime config field with Literal["local", "sandbox"] default "local" — agent/config.py:49. ✓
Config path ~/.config/ml-intern/cli_agent_config.json — agent/config.py:65. ✓
"Sandbox tool runtime requires HF_TOKEN even when the selected model is local" — matches the gate at agent/main.py:993-995, 1234-1240 (only local_model + local_runtime skips the prompt), confirmed on the prior e63b945 review.

The PR's logical resolution code (agent/tools/sandbox_tool.py:60-78, agent/tools/jobs_tool.py:1160, agent/prompts/system_prompt_v3.yaml:103-108, runtime plumbing in agent/main.py / agent/core/agent_loop.py) is unchanged from the prior LGTMs at 4be6975, 7b4687b, and e63b945.

Couldn't run the test suite locally (no uv run in this environment), but no test changes shipped in this commit so there's nothing new to verify there.
• Branch: codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-08T09:24:28Z

Claude finished @lewtun's task in 1m 34s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The new commit since the prior review at 105ca83 is b44a7de ("Wait for initial CLI sandbox preload"). Diff is agent/main.py (+16) and tests/unit/test_cli_rendering.py (+18) only.

I traced the wait against the preload's task lifecycle:

New helper _wait_for_initial_sandbox_preload at agent/main.py:76-87 awaits session.sandbox_preload_task under asyncio.shield, swallows non-cancel exceptions (with the comment that sandbox_tool surfaces stored errors on first use), and re-raises CancelledError. Matches the swallow-and-store contract in _preload at agent/tools/sandbox_tool.py:406-424 and the recovery path in get_active_or_preloaded_sandbox at agent/tools/sandbox_tool.py:459-478.
Called at agent/main.py:1193-1195 only after ready_event.wait() returns and only when not local_mode. Ordering is safe: submission_loop sets session_holder[0] = session at agent/core/agent_loop.py:2032 and immediately calls start_cpu_sandbox_preload(session) at agent/core/agent_loop.py:2034, which assigns session.sandbox_preload_task synchronously at agent/tools/sandbox_tool.py:427 before ready_event is set later. So by the time the helper runs, the task field is populated.
Headless mode (agent/main.py:1419-1429) does not add the wait, but the first sandbox-tool call goes through get_active_or_preloaded_sandbox, which performs the same asyncio.shield(task) await, so headless still works correctly without the prompt-blocking wait — appropriate for a non-interactive entry point.
Test test_initial_sandbox_preload_waits_before_prompt at tests/unit/test_cli_rendering.py:278-292 uses a real asyncio.create_task and asserts the task body actually completed before the helper returned. Tight and correct. Doesn't cover the exception-swallowing branch, but that branch is straightforward and the runtime path is exercised by existing test_sandbox_private_spaces.py cases.

The PR's logical resolution code (agent/tools/sandbox_tool.py:60-78, agent/tools/jobs_tool.py:1160, agent/prompts/system_prompt_v3.yaml:103-108, runtime plumbing in agent/main.py / agent/core/agent_loop.py) is unchanged from the prior LGTMs at 4be6975, 7b4687b, e63b945, 77a806b, and 105ca83.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on the new test.
• Branch: codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-08T10:14:24Z

Claude finished @lewtun's task in 1m 46s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The only new work since the prior LGTM at b44a7de is 6c6ac36, a merge of main bringing in #246 ("Fix GPU sandbox hardware OAuth failure"). No further commits touch the PR's logical surface.

I traced the merge against the PR's surface:

The merge dropped start_session_artifact_collection_task (removed in #246's hub_artifacts refactor) from agent/core/agent_loop.py — both the import and the call. The call previously sat between this PR's start_cpu_sandbox_preload(session) at agent/core/agent_loop.py:2033 and the logger.info("Agent loop started") line below it. Removal is clean and the PR's preload still fires under if not local_mode: at agent/core/agent_loop.py:2032-2033.
tests/unit/test_sandbox_auto_start.py lost a now-defunct monkeypatch.setattr(agent_loop, "start_session_artifact_collection_task", ...) line. The remaining test assertions on start_cpu_sandbox_preload / teardown_session_sandbox are unchanged and still cover the PR's wiring.
Teardown gate at agent/core/agent_loop.py:1947-1948 (if not getattr(session, "local_mode", False): await teardown_session_sandbox(session)) is intact — the merge didn't touch the shutdown handler.
Spot-checked grep for start_session_artifact_collection_task and hub_artifacts in PR-touched files: no stragglers.

The PR's logical resolution code (agent/tools/sandbox_tool.py:60-78, agent/tools/jobs_tool.py:1160, agent/prompts/system_prompt_v3.yaml:103-108, runtime plumbing in agent/main.py, preload/teardown in agent/core/agent_loop.py:1948,2033) is unchanged from the prior LGTMs at 4be6975, 7b4687b, e63b945, 77a806b, 105ca83, and b44a7de.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on tests/unit/test_sandbox_auto_start.py, tests/unit/test_sandbox_script_resolution.py, tests/unit/test_cli_rendering.py, and tests/unit/test_config.py.
• Branch: codex/resolve-bare-python-job-scripts

github-actions · 2026-05-08T11:52:40Z

Claude finished @lewtun's task in 1m 13s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The new commit since the prior LGTM at 6c6ac36 is 95c4cf6, a main merge bringing in #247 ("Fix stale tool error badges"). git diff 6c6ac36..HEAD is two frontend files only:

frontend/src/components/Chat/ToolCallGroup.tsx (+5/-2)
frontend/src/store/agentStore.ts (+6/-1)

Neither file is on the PR's logical surface, and no agent/tests/configs touched by this PR were modified by the merge (git diff --stat 6c6ac36..HEAD -- agent/ tests/ configs/ README.md is empty).

The PR's logical resolution code (agent/tools/sandbox_tool.py:60-78, agent/tools/jobs_tool.py:1160, agent/prompts/system_prompt_v3.yaml:103-108, runtime plumbing in agent/main.py, preload/teardown in agent/core/agent_loop.py:1948,2033, initial-preload wait at agent/main.py:76-87,1193-1195) is unchanged from the prior LGTMs at 4be6975, 7b4687b, e63b945, 77a806b, 105ca83, b44a7de, and 6c6ac36.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on tests/unit/test_sandbox_auto_start.py, tests/unit/test_sandbox_script_resolution.py, tests/unit/test_cli_rendering.py, and tests/unit/test_config.py.
• Branch: codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-08T12:21:24Z

Claude finished @lewtun's task in 1m 33s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The only new commit since the prior LGTM at 95c4cf6 is c6a618a ("Strengthen GPU sandbox preflight guidance"), a prompt/spec-only change with matching test pins. Touched files: agent/prompts/system_prompt_v3.yaml (+4/-1), agent/tools/jobs_tool.py (+4), tests/unit/test_sandbox_auto_start.py (+12), tests/unit/test_sandbox_script_resolution.py (+1).

What changed:

agent/prompts/system_prompt_v3.yaml:111 — new mandatory-GPU-preflight paragraph for hf_jobs calls when the job runs on GPU or the script touches CUDA / bf16/fp16 / quantization / flash attention / torch.compile / model loading. Requires sandbox_create (t4-small minimum), a tiny smoke test on the same imports + model-loading path + training entrypoint + tiny dataset, and an explicit justification when skipped.
agent/prompts/system_prompt_v3.yaml:116 — pre-flight checklist gains a "GPU sandbox smoke test" line.
agent/prompts/system_prompt_v3.yaml:139 — existing CPU-vs-GPU sandbox rule tightens its trigger list (adds fp16 / quantization / flash attention / torch.compile) and tells the model to test the largest useful smoke path + submit one HF job first when sandbox tiers can't fit the full model.
agent/tools/jobs_tool.py:1115-1117 — same GPU-preflight requirement added to the hf_jobs tool description.
agent/tools/jobs_tool.py:1162 — script parameter description gains "For GPU/model-loading training scripts, smoke-test in a GPU sandbox before submission."
Tests pin all the new strings — drift will fail loudly.

Consistency check against rest of prompt:

system_prompt_v3.yaml:137 ("Do NOT call sandbox_create before normal CPU work") still coheres — the new mandate fires only on GPU/model-loading work, which is exactly the case the existing rule says sandbox_create is for.
The pre-existing local-mode tension (agent/context_manager/manager.py:241-249 strips sandbox_create while the base prompt now mandates it for GPU jobs) is unchanged from prior reviews; was tagged informational then, not re-flagging per re-review convergence.

The PR's logical resolution code (agent/tools/sandbox_tool.py:60-78, agent/tools/jobs_tool.py:1162, runtime plumbing in agent/main.py, preload/teardown wait at agent/main.py:76-87,1193-1195, and agent/core/agent_loop.py:1948,2034) is unchanged from the prior LGTMs at 4be6975, 7b4687b, e63b945, 77a806b, 105ca83, b44a7de, 6c6ac36, and 95c4cf6.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on tests/unit/test_sandbox_auto_start.py and tests/unit/test_sandbox_script_resolution.py.
• Branch: codex/resolve-bare-python-job-scripts

github-actions · 2026-05-08T12:27:52Z

Claude finished @lewtun's task in 1m 54s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The only new commit since the prior LGTM at c6a618a is f14161a, a main merge bringing in #249 ("Use duplicate_repo for sandbox Spaces"). git diff c6a618a..HEAD is four files, all from the merge:

agent/tools/sandbox_client.py — api.duplicate_space(...) → api.duplicate_repo(repo_type="space", space_hardware=..., space_sleep_time=...); huggingface_hub floor bumped to >=1.12.0 in the inline script header (matched by pyproject.toml:16).
agent/utils/terminal_display.py — refactor of HELP_TEXT into a HELP_ROWS table with column-aligned format_help_text. Orthogonal to this PR's banner change at print_banner (the new tool_runtime argument).
tests/unit/test_cli_rendering.py — two help-rendering tests added by #249 for the new column-aligned output.
tests/unit/test_sandbox_private_spaces.py — fakes updated to duplicate_repo with the new kwarg names.

I confirmed the merge didn't touch the PR's logical surface: git diff c6a618a..HEAD -- agent/tools/sandbox_tool.py agent/tools/jobs_tool.py agent/prompts/system_prompt_v3.yaml agent/main.py agent/core/agent_loop.py agent/core/session.py agent/config.py configs/cli_agent_config.json README.md tests/unit/test_sandbox_auto_start.py tests/unit/test_sandbox_script_resolution.py tests/unit/test_config.py tests/unit/test_hub_artifacts.py is empty. The PR's net diff vs. origin/main does not include agent/tools/sandbox_client.py — it lives entirely in main now and the PR inherits the new API verbatim. Spot-checked grep duplicate_space in agent/ and tests/ — no stale references.

The PR's logical resolution code (agent/tools/sandbox_tool.py:60-78, agent/tools/jobs_tool.py:1162, agent/prompts/system_prompt_v3.yaml:103-108,111, runtime plumbing in agent/main.py, preload/teardown wait at agent/main.py:76-87,1193-1195, and agent/core/agent_loop.py:1948,2034) is unchanged from the prior LGTMs at 4be6975, 7b4687b, e63b945, 77a806b, 105ca83, b44a7de, 6c6ac36, 95c4cf6, and c6a618a.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on tests/unit/test_sandbox_auto_start.py, tests/unit/test_sandbox_script_resolution.py, tests/unit/test_cli_rendering.py, tests/unit/test_config.py, and tests/unit/test_sandbox_private_spaces.py.
• Branch: codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-08T12:51:47Z

Claude finished @lewtun's task in 2m 50s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The only new commit since the prior LGTM at f14161a is 072ae12 ("Guard against text-only stops with unfinished plans"). Touched files: agent/core/agent_loop.py (+85), agent/core/session.py (+1), agent/tools/plan_tool.py (+8/-4), tests/unit/test_no_tool_continuation_guard.py (+147 new). Orthogonal to the PR's CLI-sandbox / hf_jobs.script surface.

I traced the guard end-to-end:

Plan plumbing. agent/core/session.py:121 initializes current_plan: list[dict[str, str]] = [] per session. agent/tools/plan_tool.py:57-62 writes a per-session deep copy onto session.current_plan whenever plan_tool is called (in addition to the existing module-global _current_plan used by terminal rendering). Stored as a copy, so later mutation of the input doesn't bleed into session state. ✓
Guard trigger. agent/core/agent_loop.py:1345-1389 sits on the existing if not tool_calls: branch — shared by both streaming and non-streaming paths since both converge there. _unfinished_plan_items (agent_loop.py:48-57) returns items with status pending or in_progress. If the plan is empty (no plan ever created or all completed), the guard doesn't fire and the original break-with-final-response path runs.
Bounding. Retry limit is _NO_TOOL_INCOMPLETE_PLAN_RETRY_LIMIT = 2 (agent_loop.py:47). Counter is reset on every successful tool-call turn (agent_loop.py:1414), so a model that keeps making real progress (tool call → text → tool call) will get the budget back, but a model stuck in text-only stops gets at most 2 nudges before the loop returns to the user. Bounded, no infinite-loop surface.
Loop-state hygiene on retry. The retry path adds the assistant content (when non-empty) to context, then injects the [SYSTEM: CONTINUATION GUARD] user message, then iteration += 1; continue. Top-of-loop checks (cancellation at agent_loop.py:1201, compaction at 1210, doom-loop at 1215, malformed-tool detection at 1221) all run before the next LLM call. Same pattern as the existing recovery prompts above it. ✓
Resume note. Session.__init__ re-initializes current_plan = [] on resume; agent/core/session_resume.py doesn't restore plan state. So a resumed conversation whose history mentions an in-progress plan won't trigger the guard until the model re-issues plan_tool. Pre-existing behavior, not introduced here.
Tests. tests/unit/test_no_tool_continuation_guard.py:11-32, 36-58 exercise both the plan-tool side-effect and the retry path end-to-end (real Session, fake LLM that returns text → continuation prompt → plan_tool with all-completed → "Done."). Asserts on the injected prompt substring ("CONTINUATION GUARD") and the tool_log event. The retry-limit-exhaustion path isn't directly pinned, but the limit is a constant and the condition is straightforward — fine.

The PR's logical resolution code (agent/tools/sandbox_tool.py:60-78, agent/tools/jobs_tool.py:1162, agent/prompts/system_prompt_v3.yaml:103-108,111, runtime plumbing in agent/main.py, preload/teardown wait at agent/main.py:76-87,1193-1195, and agent/core/agent_loop.py:1948,2034) is unchanged from the prior LGTMs at 4be6975, 7b4687b, e63b945, 77a806b, 105ca83, b44a7de, 6c6ac36, 95c4cf6, c6a618a, and f14161a.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on tests/unit/test_no_tool_continuation_guard.py.
• Branch: codex/resolve-bare-python-job-scripts

Co-authored-by: OpenAI Codex <codex@openai.com>

github-actions · 2026-05-08T13:02:19Z

Claude finished @lewtun's task in 2m 8s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The only new commit since the prior LGTM at 072ae12 is c9d05ee ("Route sandbox deletion logs through tool events"). Diff: agent/tools/sandbox_client.py (+5/-3), agent/tools/sandbox_tool.py (+22/-9), tests/unit/test_sandbox_private_spaces.py (+121/-6), tests/unit/test_session_manager_persistence.py (+1/-1). Orthogonal to the PR's hf_jobs.script resolution surface.

I traced the refactor end-to-end:

Signature change. Sandbox.delete() at agent/tools/sandbox_client.py:779 now takes an optional log: Callable[[str], object] | None. The two print() calls at lines 786 and 794 were replaced with if log: log(...). Backward-compatible — every existing caller that doesn't pass log (the __enter__/__exit__ path at agent/tools/sandbox_client.py:822, the docstring example at :27, and tests/integration/test_live_sandbox_auth.py:95) continues to work, just silently. The error path at :824 still goes to stderr via print, which is fine for the library context manager use case.
Helper. _session_tool_logger at agent/tools/sandbox_tool.py:62-77 returns a thread-safe Callable[[str], object] that posts tool_log events via loop.call_soon_threadsafe(event_queue.put_nowait, ...). Returns None if the session has no event_queue. Both call sites are inside async functions, so asyncio.get_running_loop() is safe; session.event_queue is asyncio.Queue per agent/core/session.py:84,117, so put_nowait is correct.
Call sites.
- _create_sandbox_locked at agent/tools/sandbox_tool.py:332 — replaces the inline thread-safe log closure with the helper, falling back to a no-op lambda if event_queue is missing. The cancellation cleanup at :374 passes that callback into sb.delete(log=_log).
- teardown_session_sandbox at agent/tools/sandbox_tool.py:519,528 — calls the helper (no fallback needed, since delete() accepts None) and passes through.
Event consumption. tool_log events with tool: "sandbox" are already handled by both the headless path at agent/main.py:1471-1506 and the interactive path at agent/main.py:441-447, both routing through print_tool_log(tool, log). Same plumbing as the existing sandbox_create tool_log events at agent/tools/sandbox_tool.py:138. No new consumer needed.
Tests. Solid coverage:
- test_sandbox_delete_uses_log_callback_without_stdout at tests/unit/test_sandbox_private_spaces.py:94-115 pins both the callback contract and that capsys sees no stdout/stderr.
- test_cancelled_sandbox_creation_logs_delete_through_tool_log at tests/unit/test_sandbox_private_spaces.py:423-484 exercises the cancellation cleanup path end-to-end with a real asyncio.Queue and asserts the two tool_log events fire with the correct tool: "sandbox" data.
- The existing teardown test at tests/unit/test_sandbox_private_spaces.py:651-723 now also asserts the events.
- The two unit tests that build SimpleNamespace sandbox stubs (test_sandbox_create_replaces_auto_cpu_sandbox and test_close_cancels_preload_and_deletes_owned_sandbox) updated their delete=lambda: ... to delete=lambda log=None: ... to match the new signature.

The PR's logical resolution code (agent/tools/sandbox_tool.py:80-100 widened _looks_like_path, agent/tools/jobs_tool.py:1162 script description, agent/prompts/system_prompt_v3.yaml:103-108,111 prompt guidance, runtime plumbing in agent/main.py, preload/teardown wait at agent/main.py:76-87,1193-1195) is unchanged from the prior LGTMs at 4be6975, 7b4687b, e63b945, 77a806b, 105ca83, b44a7de, 6c6ac36, 95c4cf6, c6a618a, f14161a, and 072ae12.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on tests/unit/test_sandbox_private_spaces.py and tests/unit/test_session_manager_persistence.py.
• Branch: codex/resolve-bare-python-job-scripts

github-actions · 2026-05-08T18:10:57Z

Claude finished @lewtun's task in 1m 4s —— View job

Re-review: PR #237

LGTM — no blocking issues.

The only new commit since the prior LGTM at c9d05ee is 3501d46, a main merge bringing in #250 ("Use hyphenated ml-intern Trackio prefixes"). git diff c9d05ee..HEAD is four files, all from the merge:

agent/prompts/system_prompt_v3.yaml:69 — mlintern-<8-char-id> → ml-intern-<8-char-id> in the trackio guidance.
agent/tools/jobs_tool.py:1122,1211 — same string change in the hf_jobs tool spec and the trackio_space_id parameter description.
agent/tools/sandbox_tool.py:565,586 — same string change in the sandbox_create tool spec and its trackio_space_id parameter description.
tests/unit/test_trackio_space_ids.py (+16) — pins the new prefix.

Pure brand/string change. grep -r mlintern agent/ tests/ configs/ README.md returns nothing — no stale references. The PR's logical resolution code (agent/tools/sandbox_tool.py:80-100 widened _looks_like_path, agent/tools/jobs_tool.py:1162 script description, agent/prompts/system_prompt_v3.yaml:103-108,111 prompt guidance, runtime plumbing in agent/main.py, preload/teardown wait at agent/main.py:76-87,1193-1195) is unchanged from the prior LGTMs at 4be6975, 7b4687b, e63b945, 77a806b, 105ca83, b44a7de, 6c6ac36, 95c4cf6, c6a618a, f14161a, 072ae12, and c9d05ee.

Couldn't run the suite locally (no uv run in this environment), so trusting CI on tests/unit/test_trackio_space_ids.py and the rest of the unit tests touched by this PR.
• Branch: codex/resolve-bare-python-job-scripts

Resolve bare Python job scripts

a1ce24b

Co-authored-by: OpenAI Codex <codex@openai.com>

Update HF Jobs script description

4be6975

Co-authored-by: OpenAI Codex <codex@openai.com>

Merge branch 'main' into codex/resolve-bare-python-job-scripts

e91b538

Clarify HF Jobs script path prompt

7b4687b

Co-authored-by: OpenAI Codex <codex@openai.com>

Add CLI sandbox tool runtime

e63b945

Co-authored-by: OpenAI Codex <codex@openai.com>

Merge branch 'main' into codex/resolve-bare-python-job-scripts

77a806b

Document CLI sandbox tool runtime

105ca83

Co-authored-by: OpenAI Codex <codex@openai.com>

Wait for initial CLI sandbox preload

b44a7de

Co-authored-by: OpenAI Codex <codex@openai.com>

lewtun changed the title ~~Resolve bare Python job scripts for HF Jobs~~ Add CLI sandbox runtime and fix HF Jobs script paths May 8, 2026

Merge main into CLI sandbox runtime branch

6c6ac36

Co-authored-by: OpenAI Codex <codex@openai.com>

Merge branch 'main' into codex/resolve-bare-python-job-scripts

95c4cf6

Strengthen GPU sandbox preflight guidance

c6a618a

Co-authored-by: OpenAI Codex <codex@openai.com>

Merge branch 'main' into codex/resolve-bare-python-job-scripts

f14161a

Guard against text-only stops with unfinished plans

072ae12

Co-authored-by: OpenAI Codex <codex@openai.com>

Route sandbox deletion logs through tool events

c9d05ee

Co-authored-by: OpenAI Codex <codex@openai.com>

Merge branch 'main' into codex/resolve-bare-python-job-scripts

3501d46

Conversation

lewtun commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Usage

Behavior

Example Failure Fixed

Tests

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

P2 — tool description out of date

Uh oh!

lewtun commented May 7, 2026

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Uh oh!

github-actions Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

github-actions Bot commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Re-review: PR #237

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

lewtun commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 7, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading

github-actions Bot commented May 8, 2026 •

edited

Loading